The data acquisition process is one of the most important and initial steps of any data science pipeline. The data acquisition process for this project is as follows:
The following code snippet demonstrates the data retrieval process. The function game_id_generator() generates a list of game IDs for a given season. The class DataDownloader() is responsible for retrieving the data for a given game season and stores it in a JSON file.
def game_id_generator(year: int) -> [str]:
year = str(year)
total_games = (1230 if year == '2016' else 1271)
ids = []
# this is the regular season
for j in range(1, total_games+1):
ids.append(year+'02'+'{:04d}'.format(j))
# this is the playoffs
for i in range(1, 10):
for j in range(1, 10):
for k in range(1, 8):
ids.append(year+'030'+str(i)+str(j)+str(k))
return ids
The file data_downloader.py contains the class DataDownloader() which is responsible for retrieving the data for a given game season and stores it in a JSON file. The class contains the following key methods:
__init__(self, path: str|None, rewrite: bool = False,threaded:bool=True, worker:int=10, logger_path: str|None = None, log_level: int|None = logging.INFO): The constructor of the class. It takes path to the directory where the data will be stored, a boolean value indicating whether to rewrite the data if it already exists, a boolean value indicating whether to use multithreading, the number of threads to use, the path to the logger file, and the log level. The default values are set to None, False, True, 10, None, and logging.INFO respectively.download(self, year: int) -> None: This method is responsible for downloading the data for a given year. It takes the year as an argument and returns None.A major feature of the Downloader class is that it can be used to download the data in parallel. This is achieved by using the threading module.

The screenshot displays some information about a specific game and an event within it, all of which can be dynamically configured using four interactive widgets. Below is the implementation of the tool.
files = getFiles(f'201602')
data = read_data(files[0])
# Initialize widgets
seasons = widgets.Dropdown(
options=['2016', '2017', '2018', '2019', '2020'], description='Season:')
game_type = widgets.Dropdown(
options=['Regular', 'Playoffs'], description='Game Type:')
game_id_slider = widgets.IntSlider(
min=1, max=len(files), step=1, description='Game ID:')
event_slider = widgets.IntSlider(min=1, max=len(
data['liveData']['plays']['allPlays']), step=1, description='Event:')
def update_game_id_slider(*args):
global files
global data
files = getFiles(f'{seasons.value}{game_type_digits[game_type.value]}')
game_id_slider.value = 1
game_id_slider.max = len(files)
update_event_slider()
def update_event_slider(*args):
global data
global files
data = read_data(files[game_id_slider.value-1])
event_count = len(data['liveData']['plays']['allPlays'])
if(event_count):
event_slider.max = event_count
event_slider.value = 1
event_slider.min = 1
else:
event_slider.value = 0
event_slider.min = 0
event_slider.max = event_count
def update_event_plot(season, game_type, game_id, event_index):
events = data['liveData']['plays']['allPlays']
if (not events):
print('No event')
return
print("gameId: ", data['gamePk'])
home = data['liveData']['linescore']['teams']['home']['team']['abbreviation']
away = data['liveData']['linescore']['teams']['away']['team']['abbreviation']
print(f'{home} vs. {away}')
event_data = events[event_index-1]
coordinates = event_data['coordinates']
if (not coordinates):
return print(json.dumps(event_data, indent=4))
period = event_data['about']['period']
t = [i for i in data['liveData']['linescore']
['periods'] if i['num'] == period]
if (t):
isHomeOnRight = 1 if t[0]['home']['rinkSide'] == 'right' else -1
summary = f"Event: {event_data['result']['event']}\nPeriod: {event_data['about']['period']}\nTime: {event_data['about']['periodTime']}\nTeam: {event_data['team']['name']}"
print(summary)
plt.title(event_data['result']['description'], y=1.1)
plt.imshow(rink_image_np, extent=[-100, 100, -42.5, 42.5])
plt.ylim(-42.5, 42.5)
plt.xlim(-100, 100)
plt.xticks([-100.0, -75.0, -50.0, -25.0, 0.0, 25.0, 50.0, 75.0, 100.0])
plt.yticks([-42.5, -21.25, 0, 21.25, 42.5])
plt.scatter(coordinates['x'], coordinates['y'])
plt.text(isHomeOnRight*(-75), 47, away, ha='center',
va='center', fontsize=12)
plt.text(isHomeOnRight*(75), 47, home, ha='center',
va='center', fontsize=12)
plt.xlabel("Feet")
plt.ylabel("Feet")
plt.show()
seasons.observe(update_game_id_slider, 'value')
game_type.observe(update_game_id_slider, 'value')
game_id_slider.observe(update_event_slider, 'value')
# Create interactive plot
interactive_plot = interactive(
update_event_plot, season=seasons, game_type=game_type, game_id=game_id_slider, event_index=event_slider)
output = interactive_plot.children[-1]
output.layout.height = '450px'
display(interactive_plot)
First 10 rows of tidied dataframe:

Assuming penalty events are provided with a start time (X), duration (T), and the penalized team (A), any events occurring within the time frame (X + T) will see team (A) with a reduced player count by at least one, compared to the last event before time (X). This principle also applies to the opposing team. We will maintain a record of the number of players on each team from the start of the game (typically 5-5) until its conclusion. Consequently, we can deduce the on-ice strength during shots and goals within the time frame (X + T) based on the team executing the event.
Real-time performance analysis enables a detailed examination of both team and player behaviors during a game. I will incorporate three metrics for each team, calculated from the start of the game up to each event:
From these plots, you can infer the playstyles of different teams in a given season. By observing zones of excess shots (darkest red), you can determine where a team typically shoots from and whether it’s closer to the goal or not. You can also notice the side, which might be influenced by whether the shooters are right or left-handed, for example. Looking at the overall picture, you can also draw conclusions about the average shot rate; if a team has a blue or red area across the board, it indicates that they shoot, on average, less or more than the league average, respectively. Having these figures for multiple seasons, you can also track how the playstyles of different teams and the league as a whole evolves over the years.
Upon examining the two shot maps, we can see that the Colorado Avalanche team was significantly more active in the 2020-2021 season compared to the 2016-2017 season. In the 2016-2017 season, the team was notably more active on the left side of the offensive zone, but they shot less than the league average in the middle of the offensive zone. In the 2020-2021 season, they were slightly less active near the goal but much more engaged in the middle, with a broad region of red between 20 and 60 feet from the center of the rink. During the 2020-2021 season, the Colorado Avalanche finished first in the league, while they ended up 30th in the 2016-2017 season. What appears to be a change in playstyle, characterized by an increase in shots from further out, seems to have contributed to a better standing. However, these observations must be taken with caution because we are comparing the team indirectly against the league average for those seasons. So, we cannot be entirely certain that the team shot more in 2020-2021, only that they shot more than that year’s specific average.
Analyzing such intricate maps can be challenging when examining numerous figures simultaneously. If we compare the teams season by season, it appears that, on average, the Tampa Bay Lightning has a higher shot rate than the Buffalo Sabres. In the plot for 2018-2019, for instance, we observe that the TB team had a higher average shot rate in the two faceoff circles closest to the goal. Another observation is that the Tampa Bay Lightning seems to shoot more from the right side, suggesting that having a strong right winger might contribute to their success. We can also see that both teams do not do a lot of tip-ins (blue area around the goal for each season), but except for the 2019-2020 season, this is much more marked with the Buffalo Sabres. When creating these figures and selecting smoothing parameters, we opted to retain a certain level of noise for a more accurate representation of the data, although this makes the maps slightly more challenging to interpret at first glance. Even so, these shot maps definitely do not provide a complete picture. We understand that the distance to the goal correlates with the goal percentage, but the shot type and numerous other factors are also influential. Possessing a map that displays goal rate is more crucial than shots, as ultimately, it’s goals that determine the game’s outcome, and a team that shoots frequently with low accuracy won’t excel. We also need to consider the defensive zone, as a well-rounded team excels in both offense and defense.